fix: replace premature media file cleanup with background MediaCleaner#620
fix: replace premature media file cleanup with background MediaCleaner#620ex-takashima wants to merge 5 commits intosipeed:mainfrom
Conversation
|
I have some concerns about introducing a global background MediaCleaner goroutine as the primary mechanism for file cleanup. From a resource lifecycle perspective, it’s generally safer to follow an ownership principle: the component that creates a resource should also be responsible for cleaning it up. In the previous implementation, cleanup was explicit (e.g. via defer os.Remove(...)) and tightly coupled to the request/event lifecycle. That made ownership and timing clear. With the current approach: Cleanup is now time-based (e.g. “older than 30 minutes”) rather than usage-based. The deletion logic is detached from the code that creates and consumes the file. Lifecycle becomes implicit and harder to reason about. Time-based cleanup is not equivalent to “no longer in use.” If any processing chain legitimately exceeds the configured threshold (slow downstream services, retries, backpressure, long-running tasks), files could be deleted while still logically in use. Even if this avoids premature deletion in some race conditions, it replaces a deterministic lifecycle with a heuristic one. I understand the motivation behind avoiding premature cleanup, but I think we can achieve that without shifting ownership to a global janitor goroutine. Some possible alternatives: Have DownloadFile (or equivalent) return a cleanup func() along with the path. The caller invokes it once processing is truly complete. This keeps ownership explicit while still allowing flexible timing. Organize temp files under a per-request/per-session directory, and remove the entire directory once the request lifecycle ends. This keeps cleanup scoped and easier to reason about. Keep MediaCleaner only as a safety net (e.g. for orphaned files after crashes), rather than as the primary cleanup mechanism. Explicit cleanup should remain the normal path; the background cleaner should be a fallback. In short, I think we should prefer deterministic, ownership-based cleanup over time-based sweeping. The latter can be a useful safeguard, but it shouldn’t replace explicit lifecycle management. |
|
Thanks for the thoughtful review, @Zhaoyikaiii! These are valid concerns. I agree that ownership-based cleanup is generally preferable. However, the challenge here is the async boundary — This means the alternatives run into the same fundamental issue:
Adding such a signal (e.g., a completion callback in the message bus, or reference counting on media files) would be a more significant architectural change and is probably beyond the scope of this bug fix. That said, the 30-minute threshold is intentionally conservative — typical message processing completes in seconds. But I understand the concern about it being heuristic rather than deterministic. Happy to discuss further or hold this PR if a broader architectural approach is preferred. |
|
Thanks for the clarification — I agree the missing “processing complete” signal across PublishInbound/agent loop makes deterministic cleanup hard right now. Given that, could we make the MediaCleaner behavior configurable so deployments can choose their trade-off explicitly? Suggestions:
This keeps the current approach as a pragmatic fix, while letting operators tune/disable it if their workloads have longer async processing times or they prefer external temp management. |
|
Good suggestion, @Zhaoyikaiii — just pushed an update that makes MediaCleaner fully configurable via {
"tools": {
"media_cleanup": {
"enabled": true,
"max_age_minutes": 30,
"interval_minutes": 5
}
}
}Also supported via environment variables: The effective config is now logged at startup for easier debugging. Operators can disable cleanup entirely or tune the thresholds for their specific deployment. (Note: the remaining CI lint failure is a pre-existing |
|
Perhaps you could try rebasing to resolve the CI issue; the latest code in the repository should all pass CI. |
Channel handlers delete downloaded media files via defer os.Remove() before the async agent loop consumer can read them, causing silent failures in voice/image/document processing. Remove the immediate defer cleanup from telegram, line, slack, and onebot handlers. Introduce a background MediaCleaner goroutine that periodically removes files older than 30 minutes from the temp media directory, preventing unbounded accumulation. discord.go is unchanged because it processes files synchronously. Closes sipeed#619 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix gci errors caused by lost indentation when removing localFiles lines, and use 0o700/0o600 octal syntax for gofumpt compliance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add MediaCleanupConfig with enabled, max_age_minutes, and interval_minutes fields. MediaCleaner is only started when enabled (default: true) and logs its effective settings at startup. Defaults: interval=5m, max_age=30m. Operators can tune or disable cleanup for deployments with longer async processing times. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6e499d6 to
0e5d30a
Compare
|
Thanks @alexhoshina — just rebased onto latest main. The pre-existing |
nikolasdehor
left a comment
There was a problem hiding this comment.
Good fix for a real race condition. The approach is sound: remove the premature defer os.Remove() from channel handlers and replace with a background cleaner that uses file age as the deletion criterion. This ensures the agent loop has time to consume the file.
Code quality observations (non-blocking):
-
Cleanup on first tick only: The cleaner runs
cleanup()only on ticker ticks, not immediately at start. This means files from a previous crash could sit for up tointervalminutes before the first sweep. Consider runningmc.cleanup()once at the beginning ofloop()before entering the ticker loop. Minor, since the default interval is only 5 minutes. -
Symlink safety:
cleanup()iterates directory entries and callsos.Remove()on regular files. If a symlink were placed in the media directory (e.g., by a malicious tool call or skill),os.Removewould remove the symlink itself (not the target), which is safe. Good. -
No recursive directory cleanup: The cleaner skips
entry.IsDir(), which is correct sinceDownloadFileonly creates flat files. But if a future change nests media in subdirs, those would accumulate. Acceptable for now. -
Config placement:
MediaCleanupConfigis nested underToolsConfig, which is slightly odd since media cleanup is a cross-cutting concern not specific to tools. But it matches the existingToolsConfigas a grab-bag for non-channel config, so it is fine. -
Test coverage is appropriate for the cleaner logic. The channel handler diffs are purely mechanical removals, which is clean.
LGTM. This correctly fixes the race condition from #619 without introducing unbounded temp file growth.
|
Thanks for the thorough review, @nikolasdehor! Heads-up: since this PR, the approach has evolved. @alexhoshina asked me to integrate the cleanup directly into This PR (#620) will likely be superseded by #728 once it's merged. Your observations about first-tick cleanup and config placement are noted and still relevant there. |
|
Closing in favor of #728, which was merged into |
Summary
Fixes #619
This addresses the media file race condition originally identified by @DevEverything01 in #543 — channel handlers delete downloaded temp files via
defer os.Remove()before the async agent loop consumer can read them, causing silent failures in voice/image/document processing.Rather than simply removing the cleanup (which would cause unbounded temp file accumulation), this PR:
defer os.Remove()from the 4 affected channel handlers (telegram,line,slack,onebot)MediaCleanergoroutine that scansos.TempDir()/picoclaw_media/every 5 minutes and removes files older than 30 minutesMediaDirconstant to avoid hardcoded stringsdiscord.gois intentionally unchanged — it processes voice files synchronously before returning, so there is no race condition.Changes
pkg/utils/media.goMediaDirconst,MediaCleanerstruct withStart()/Stop()pkg/utils/media_test.gopkg/channels/telegram.golocalFilestracking anddefer os.Removeblockpkg/channels/line.golocalFilestracking anddefer os.Removeblockpkg/channels/slack.golocalFilestracking anddefer os.Removeblockpkg/channels/onebot.goLocalFilesfield,localFilestracking, anddefer os.Removeblockcmd/picoclaw/cmd_gateway.goMediaCleanerwith gateway lifecycleTest plan
go vet ./pkg/utils/ ./pkg/channels/— cleango build ./pkg/utils/ ./pkg/channels/— cleango test ./pkg/utils/... -run TestMediaCleaner -v— 2/2 PASS🤖 Generated with Claude Code